European Radiology — Latest Matching Preprints

1

Comparison of the Diagnostic Performance from Patient's Medical History and Imaging Findings between GPT-4 based ChatGPT and Radiologists in Challenging Neuroradiology Cases

Horiuchi, D.; Tatekawa, H.; Oura, T.; Oue, S.; Walston, S. L.; Takita, H.; Matsushita, S.; Mitsuyama, Y.; Shimono, T.; Miki, Y.; Ueda, D.

2023-08-29 radiology and imaging 10.1101/2023.08.28.23294607 medRxiv

Top 0.1%

55.6%

Show abstract

PurposeTo compare the diagnostic performance between Chat Generative Pre-trained Transformer (ChatGPT), based on the GPT-4 architecture, and radiologists from patients medical history and imaging findings in challenging neuroradiology cases. MethodsWe collected 30 consecutive "Freiburg Neuropathology Case Conference" cases from the journal Clinical Neuroradiology between March 2016 and June 2023. GPT-4 based ChatGPT generated diagnoses from the patients provided medical history and imaging findings for each case, and the diagnostic accuracy rate was determined based on the published ground truth. Three radiologists with different levels of experience (2, 4, and 7 years of experience, respectively) independently reviewed all the cases based on the patients provided medical history and imaging findings, and the diagnostic accuracy rates were evaluated. The Chi-square tests were performed to compare the diagnostic accuracy rates between ChatGPT and each radiologist. ResultsChatGPT achieved an accuracy rate of 23% (7/30 cases). Radiologists achieved the following accuracy rates: a junior radiology resident had 27% (8/30) accuracy, a senior radiology resident had 30% (9/30) accuracy, and a board-certified radiologist had 47% (14/30) accuracy. ChatGPTs diagnostic accuracy rate was lower than that of each radiologist, although the difference was not significant (p = 0.99, 0.77, and 0.10, respectively). ConclusionThe diagnostic performance of GPT-4 based ChatGPT did not reach the performance level of either junior/senior radiology residents or board-certified radiologists in challenging neuroradiology cases. While ChatGPT holds great promise in the field of neuroradiology, radiologists should be aware of its current performance and limitations for optimal utilization.

2

Comparison of the diagnostic accuracy among GPT-4 based ChatGPT, GPT-4V based ChatGPT, and radiologists in musculoskeletal radiology

Horiuchi, D.; Tatekawa, H.; Oura, T.; Shimono, T.; Walston, S. L.; Takita, H.; Matsushita, S.; Mitsuyama, Y.; Miki, Y.; Ueda, D.

2023-12-09 radiology and imaging 10.1101/2023.12.07.23299707 medRxiv

Top 0.1%

55.1%

Show abstract

ObjectiveTo compare the diagnostic accuracy of Generative Pre-trained Transformer (GPT)-4 based ChatGPT, GPT-4 with vision (GPT-4V) based ChatGPT, and radiologists in musculoskeletal radiology. Materials and MethodsWe included 106 "Test Yourself" cases from Skeletal Radiology between January 2014 and September 2023. We input the medical history and imaging findings into GPT-4 based ChatGPT and the medical history and images into GPT-4V based ChatGPT, then both generated a diagnosis for each case. Two radiologists (a radiology resident and a board-certified radiologist) independently provided diagnoses for all cases. The diagnostic accuracy rates were determined based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4 based ChatGPT, GPT-4V based ChatGPT, and radiologists. ResultsGPT-4 based ChatGPT significantly outperformed GPT-4V based ChatGPT (p < 0.001) with accuracy rates of 43% (46/106) and 8% (9/106), respectively. The radiology resident and the board-certified radiologist achieved accuracy rates of 41% (43/106) and 53% (56/106). The diagnostic accuracy of GPT-4 based ChatGPT was comparable to that of the radiology resident but was lower than that of the board-certified radiologist, although the differences were not significant (p = 0.78 and 0.22, respectively). The diagnostic accuracy of GPT-4V based ChatGPT was significantly lower than those of both radiologists (p < 0.001 and < 0.001, respectively). ConclusionGPT-4 based ChatGPT demonstrated significantly higher diagnostic accuracy than GPT-4V based ChatGPT. While GPT-4 based ChatGPTs diagnostic performance was comparable to radiology residents, it did not reach the performance level of board-certified radiologists in musculoskeletal radiology.

3

Comparative Analysis of ChatGPT's Diagnostic Performance with Radiologists Using Real-World Radiology Reports of Brain Tumors

Mitsuyama, Y.; Tatekawa, H.; Takita, H.; Sasaki, F.; Tashiro, A.; Satoshi, O.; Walston, S. L.; Miki, Y.; Ueda, D.

2023-10-28 radiology and imaging 10.1101/2023.10.27.23297585 medRxiv

Top 0.1%

54.7%

Show abstract

BackgroundLarge Language Models like Chat Generative Pre-trained Transformer (ChatGPT) have demonstrated potential for differential diagnosis in radiology. Previous studies investigating this potential primarily utilized quizzes from academic journals, which may not accurately represent real-world clinical scenarios. PurposeThis study aimed to assess the diagnostic capabilities of ChatGPT using actual clinical radiology reports of brain tumors and compare its performance with that of neuroradiologists and general radiologists. MethodsWe consecutively collected brain MRI reports from preoperative brain tumor patients at Osaka Metropolitan University Hospital, taken from January to December 2021. ChatGPT and five radiologists were presented with the same findings from the reports and asked to suggest differential and final diagnoses. The pathological diagnosis of the excised tumor served as the ground truth. Chi-square tests and Fishers exact test were used for statistical analysis. ResultsIn a study analyzing 99 radiological reports, ChatGPT achieved a final diagnostic accuracy of 75% (95% CI: 66, 83%), while radiologists accuracy ranged from 64% to 82%. ChatGPTs final diagnostic accuracy using reports from neuroradiologists was higher at 82% (95% CI: 71, 89%), compared to 52% (95% CI: 33, 71%) using those from general radiologists with a p-value of 0.012. In the realm of differential diagnoses, ChatGPTs accuracy was 95% (95% CI: 91, 99%), while radiologists fell between 74% and 88%. Notably, for these differential diagnoses, ChatGPTs accuracy remained consistent whether reports were from neuroradiologists (96%, 95% CI: 89, 99%) or general radiologists (91%, 95% CI: 73, 98%) with a p-value of 0.33. ConclusionChatGPT exhibited good diagnostic capability, comparable to neuroradiologists in differentiating brain tumors from MRI reports. ChatGPT can be a second opinion for neuroradiologists on final diagnoses and a guidance tool for general radiologists and residents, especially for understanding diagnostic cues and handling challenging cases. SummaryThis study evaluated ChatGPTs diagnostic capabilities using real-world clinical MRI reports from brain tumor cases, revealing that its accuracy in interpreting brain tumors from MRI findings is competitive with radiologists. Key resultsO_LIChatGPT demonstrated a diagnostic accuracy rate of 75% for final diagnoses based on preoperative MRI findings from 99 brain tumor cases, competing favorably with five radiologists whose accuracies ranged between 64% and 82%. For differential diagnoses, ChatGPT achieved a remarkable 95% accuracy, outperforming several of the radiologists. C_LIO_LIRadiology reports from neuroradiologists and general radiologists showed varying accuracy when input into ChatGPT. Reports from neuroradiologists resulted in higher diagnostic accuracy for final diagnoses, while there was no difference in accuracy for differential diagnoses between neuroradiologists and general radiologists. C_LI

4

Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4(V) in Challenging Brain MRI Cases

Schramm, S.; Preis, S.; Metz, M.-C.; Jung, K.; Schmitz-Koep, B.; Zimmer, C.; Wiestler, B.; Hedderich, D. M.; Kim, S. H.

2024-03-06 radiology and imaging 10.1101/2024.03.05.24303767 medRxiv

Top 0.1%

43.8%

Show abstract

BackgroundRecent studies have explored the application of multimodal large language models (LLMs) in radiological differential diagnosis. Yet, how different multimodal input combinations affect diagnostic performance is not well understood. PurposeTo evaluate the impact of varying multimodal input elements on the accuracy of GPT-4(V)-based brain MRI differential diagnosis. MethodsThirty brain MRI cases with a challenging yet verified diagnosis were selected. Seven prompt groups with variations of four input elements (image, image annotation, medical history, image description) were defined. For each MRI case and prompt group, three identical queries were performed using an LLM-based search engine ((C) PerplexityAI, powered by GPT-4(V)). Accuracy of LLM-generated differential diagnoses was rated using a binary and a numeric scoring system and analyzed using a chi-square test and a Kruskal-Wallis test. Results were corrected for false discovery rate employing the Benjamini-Hochberg procedure. Regression analyses were performed to determine the contribution of each individual input element to diagnostic performance. ResultsThe prompt group containing an annotated image, medical history, and image description as input exhibited the highest diagnostic accuracy (67.8% correct responses). Significant differences were observed between prompt groups, especially between groups that contained the image description among their inputs, and those that did not. Regression analyses confirmed a large positive effect of the image description on diagnostic accuracy (p << 0.001), as well as a moderate positive effect of the medical history (p < 0.001). The presence of unannotated or annotated images had only minor or insignificant effects on diagnostic accuracy. ConclusionThe textual description of radiological image findings was identified as the strongest contributor to performance of GPT-4(V) in brain MRI differential diagnosis, followed by the medical history. The unannotated or annotated image alone yielded very low diagnostic performance. These findings offer guidance on the effective utilization of multimodal LLMs in clinical practice.

5

Artificial intelligence-generated smart impression from 9.8-million radiology reports as training datasets from multiple sites and imaging modalities

Kaviani, P.; Kalra, M. K.; Digumarthy, S. R.; Rodriguez, K.; Agarwal, S.; Brooks, R.; En, S.; Alkasab, T.; Bizzo, B. C.; Dreyer, K. J.

2024-03-09 radiology and imaging 10.1101/2024.03.07.24303787 medRxiv

Top 0.1%

43.2%

Show abstract

ImportanceAutomatic generation of the impression section of radiology report can help make radiologists efficient and avoid reporting errors. ObjectiveTo evaluate the relationship, content, and accuracy of an Powerscribe Smart Impression (PSI) against the radiologists reported findings and impression (RDF). Design, Setting, and ParticipantsThe institutional review board approved retrospective study developed and trained an PSI algorithm (Nuance Communications, Inc.) with 9.8 million radiology reports from multiple sites to generate PSI based on information including the protocol name and the radiologists-dictated findings section of radiology reports. Three radiologists assessed 3879 radiology reports of multiple imaging modalities from 8 US imaging sites. For each report, we assessed if PSI can accurately reproduce the RDF in terms of the number of clinically significant findings and radiologists style of reporting while avoiding potential mismatch (with the findings section in terms of size, location, or laterality). Separately we recorded the word count for PSI and RDF. Data were analyzed with Pearson correlation and paired t-tests. Main Outcomes and MeasuresThe data were ground truthed by three radiologists. Each radiologists recorded the frequency of the incidental/significant findings, any inconsistency between the RDF and PSI as well as the stylistic evaluation overall evaluation of PSI. Area under the curve (AUC), correlation coefficient, and the percentages were calculated. ResultsPSI reports were deemed either perfect (91.9%) or acceptable (7.68%) for stylistic concurrence with RDF. Both PSI (mismatched Hallers Index) and RDF (mismatched nodule size) had one mismatch each. There was no difference between the word counts of PSI (mean 33{+/-}23 words/impression) and RDF (mean 35{+/-}24 words/impression) (p>0.1). Overall, there was an excellent correlation (r= 0.85) between PSI and RDF for the evolution of findings (negative vs. stable vs. new or increasing vs. resolved or decreasing findings). The PSI outputs (2%) requiring major changes pertained to reports with multiple impression items. Conclusion and RelevanceIn clinical settings of radiology exam interpretation, the Powerscribe Smart Impression assessed in our study can save interpretation time; a comprehensive findings section results in the best PSI output.

6

The Expertise Paradox: Who Benefits from LLM-Assisted Brain MRI Differential Diagnosis?

Schramm, S.; Le Guellec, B.; Topka, M.; Svec, M.; Backhaus, P.; Eisenkolb, V. M.; Riedel, E. O.; Beyrle, M.; Platzek, P.-S.; Ramschütz, C.; Paprottka, K. J.; Renz, M.; Bodden, J.; Kirschke, J. S.; Ziegelmeyer, S.; Busch, F.; Makowski, M. R.; Adams, L. C.; Bressem, K. K.; Hedderich, D. M.; Wiestler, B.; Kim, S. H.

2025-10-28 radiology and imaging 10.1101/2025.10.28.25338816 medRxiv

Top 0.1%

42.6%

Show abstract

PurposeTo evaluate how reader experience influences the diagnostic benefit from LLM assistance in brain MRI differential diagnosis. Materials and MethodsNeuroradiologists (n = 4), radiology residents (n = 4), and neurology/neurosurgery residents (n = 4) were recruited. A dataset of complex brain MRI cases was curated from the local imaging database (n = 40). For each case, readers provided a textual description of the main imaging finding and their top three differential diagnoses ("Unassisted"). Three state-of-the-art large language models (GPT-4.1, Gemini 2.5 Pro, DeepSeek-R1) were prompted to generate top-three differentials based on the clinical case description and reader-specific findings. Readers then revised their differential diagnoses after reviewing GPT-4.1 suggestions ("Assisted"). To evaluate the association between reader experience and diagnostic benefit, a cumulative link mixed model (CLMM) was fitted, with change in diagnostic result as ordinal outcome, reader experience as predictor, and random intercepts for rater and case. ResultsLLM-generated differential diagnoses achieved the highest top-3 accuracy when provided with image descriptions from neuroradiologists (top-3: 78.8-83.8%), followed by radiology residents (top-3: 71.8-77.6%), and neurology/neurosurgery residents (top-3: 62.6-64.5%). In contrast, mean relative gains in top-3 accuracy through LLM assistance diminished with increasing experience, with +19.2% for neurology/neurosurgery residents (from 43.2% to 62.6%), +14.7% for radiology residents (from 59.6% to 74.4%), and +4.4% for neuroradiologists (from 83.1% to 87.5%). The CLMM demonstrated a significant negative association between reader experience and diagnostic benefit from LLM assistance ({beta} = -0.10, p = 0.005). ConclusionWith increasing reader experience, absolute diagnostic LLM performance with reader-generated input improved, while relative diagnostic gains through LLM assistance paradoxically diminished. Our findings call attention to the divergence between standalone LLM performance and clinically relevant reader benefit, and emphasize the need to account for human-AI interaction in this context.

7

Accuracy And Generalizability of an Open-Source Deep Learning Model For Facial Bone Segmentation on CT and CBCT Scans

Gkantidis, N.; Ghamri, M.; DOT, G.

2025-12-29 dentistry and oral medicine 10.64898/2025.12.28.25343101 medRxiv

Top 0.1%

40.2%

Show abstract

AimTo evaluate the accuracy and generalizability of DentalSegmentator, an open-source deep learning tool, for automated segmentation of skeletal facial surfaces from computed tomography (CT) scans acquired under different imaging conditions. Materials and MethodsTen human skulls were scanned using a CT scanner and three cone beam CT (CBCT) protocols (including an ultra-low-dose protocol) on two CBCT devices. High-accuracy reference surface models were acquired using an optical scanner. CBCT an CT scans were segmented automatically using DentalSegmentator. Three facial regions (forehead, zygomatic process, maxillary process) were defined on each model for quantitative assessment. Accuracy was measured as the mean absolute distance (MAD) and the standard deviation of absolute distances (SDAD) between segmented and reference models after best-fit superimposition. ResultsRepeated segmentations were identical, confirming perfect reproducibility. Across all acquisition settings and regions, DentalSegmentator produced highly accurate skeletal surface models, with an overall MAD of 0.088 mm (IQR 0.073) and SDAD of 0.061 mm (IQR 0.028). Significant but small differences were detected between imaging systems (MAD: p < 0.001; SDAD: p = 0.003), with CT scans showing slightly reduced trueness compared with CBCT images. ConclusionThe open-source DentalSegmentator tool produced accurate skeletal facial surface segmentations across diverse CT and CBCT settings, demonstrating excellent generalizability, including under low-radiation conditions. Minor differences in trueness between imaging systems were small and unlikely to impact clinical or research use. Clinical SignificanceDeep learning offers a robust foundation for automated 3D craniofacial surface extraction, supporting broader adoption of AI-driven workflows in both clinical and research contexts.

8

Empowering Radiologists with ChatGPT-4o: Comparative Evaluation of Large Language Models and Radiologists in Cardiac Cases

Cesur, T.; Gunes, Y. C.; Camur, E.; Dagli, M.

2024-06-25 radiology and imaging 10.1101/2024.06.25.24309247 medRxiv

Top 0.1%

39.7%

Show abstract

PurposeThis study evaluated the diagnostic accuracy and differential diagnosis capabilities of 12 Large Language Models (LLMs), one cardiac radiologist, and three general radiologists in cardiac radiology. The impact of ChatGPT-4o assistance on radiologist performance was also investigated. Materials and MethodsWe collected publicly available 80 "Cardiac Case of the Month from the Society of Thoracic Radiology website. LLMs and Radiologist-III were provided with text-based information, whereas other radiologists visually assessed the cases with and without ChatGPT-4o assistance. Diagnostic accuracy and differential diagnosis scores (DDx Score) were analyzed using the chi-square, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests. ResultsThe unassisted diagnostic accuracy of the cardiac radiologist was 72.5%, General Radiologist-I was 53.8%, and General Radiologist-II was 51.3%. With ChatGPT-4o, the accuracy improved to 78.8%, 70.0%, and 63.8%, respectively. The improvements for General Radiologists-I and II were statistically significant (P[≤]0.006). All radiologists DDx scores improved significantly with ChatGPT-4o assistance (P[≤]0.05). Remarkably, Radiologist-Is GPT-4o-assisted diagnostic accuracy and DDx Score were not significantly different from the Cardiac Radiologists unassisted performance (P>0.05). Among the LLMs, Claude 3.5 Sonnet and Claude 3 Opus had the highest accuracy (81.3%), followed by Claude 3 Sonnet (70.0%). Regarding the DDx Score, Claude 3 Opus outperformed all models and Radiologist-III (P<0.05). The accuracy of the general radiologist-III significantly improved from 48.8% to 63.8% with GPT4o-assistance (P<0.001). ConclusionChatGPT-4o may enhance the diagnostic performance of general radiologists for cardiac imaging, suggesting its potential as a valuable diagnostic support tool. Further research is required to assess its clinical integration.

9

A survey of Paediatric Radiology Artificial Intelligence

Kelly, B. S.; Clifford, S.; Judge, C.; Bollard, S. M.; Healy, G. M.; Hughes, H.; Colleran, G. C.; Rod, J. E.; Mathur, P.; Prolo, L. M.; Lee, E. H.; Yeom, K. W.; Lawlor, A.; Killeen, R. P.

2024-09-22 radiology and imaging 10.1101/2024.09.19.24313885 medRxiv

Top 0.1%

39.0%

Show abstract

BackgroundArtificial intelligence (AI) applications in paediatric radiology present unique challenges due to diverse anatomy and physiology across age groups. Advancements in AI algorithms, particularly deep learning techniques, show promise in improving diagnostic accuracy. ObjectivesTo survey trends in AI research in paediatric radiology. To evaluate use cases, tasks, research methodologies and underlying data. To identify potential biases and future directions. MethodsA systematic search of paediatric radiology AI studies published from 2015 to 2021 was conducted following the PRISMA guidelines and the Cochrane Collaboration Handbook. The search included papers utilizing AI techniques for radiological diagnosis or intervention in patients aged under 18. Narrative synthesis was used due to methodological heterogeneity. ResultsA total of 292 articles were included, with an increasing annual trend in the number of published articles. Neuroradiology and musculoskeletal radiology were the most common subspecialties. MRI was the dominant imaging modality, with segmentation and classification as the most common tasks. Retrospective cohort studies constituted the majority of research designs. Data quality and quantity varied, as did the choice of research design, data sources, and evaluation metrics. ConclusionsAI literature in paediatric radiology shows rapid growth, with advancements in various subspecialties and tasks. However, potential biases and data quality issues highlight the need for rigorous research design and evaluation to ensure the generalisability and reliability of AI models in clinical practice. Future research should focus on addressing these biases and improving the robustness of AI applications in paediatric radiology.

10

A retrospective analysis of the diagnostic performance of an FDA approved software for the detection of intracranial hemorrhage

Pourmussa, B.; Gorovoy, D.

2023-11-03 radiology and imaging 10.1101/2023.11.02.23297974 medRxiv

Top 0.1%

38.4%

Show abstract

ObjectiveTo determine the sensitivity, specificity, accuracy, positive predictive value (PPV), and negative predictive value (NPV) of Rapid ICH, a commercially available AI model, in detecting intracranial hemorrhage (ICH) on non-contrast computed tomography (NCCT) examinations of the head at a single regional medical center. MethodsRapidAIs Rapid ICH is incorporated into real time hospital workflow to assist radiologists in the identification of ICH on NCCT examinations of the head. 412 examinations from August 2022 to January 2023 were pulled for analysis. Scans in which it was unclear if ICH was present or not, as well as scans significantly affected by motion artifact were excluded from the study. The sensitivity, specificity, accuracy, PPV, and NPV of the software were then assessed retrospectively for the remaining 406 NCCT examinations using prior radiologist report as the ground-truth. A two tailed z test with = 0.05 was preformed to determine if the sensitivity and specificity of the software in this study were significantly different from Rapid ICHs reported sensitivity and specificity. Additionally, the softwares performance was analyzed separately for the male and female populations and a chi-square test of independence was used to determine if model correctness significantly depended on sex. ResultsOf the 406 scans assessed, Rapid ICH flagged 82 ICH positive cases and 324 ICH negative cases. There were 80 examinations (19.7%) truly positive for ICH and 326 examinations (80.3%) negative for ICH. This resulted in a sensitivity of 71.3%, 95% CI [61.3%-81.2%], a specificity of 92.3%, 95% CI [89.4%-95.2%], an accuracy of 88.2%, 95% CI [85.0%-91.3%], a PPV of 69.5%, 95% CI [59.5%-79.5%], and an NPV of 92.9%, 95% CI [90.1%-95.7%]. Two examinations were excluded due to no existing information on patient sex in the electronic medical record. The resulting sensitivity was significantly different from the sensitivity reported by Rapid ICH (95%), z = 2.60, p = .009 although the resulting specificity was not significantly different from the specificity reported by Rapid ICH (94%), z = 0.65, p = .517. The model performance did not depend on sex per the chi-square test of independence: X2 (1 degree of freedom, N = 404) = 1.95, p = .162 (p > 0.05). ConclusionRapid ICH demonstrates exceptional capability in the identification of ICH, but its performance when used at this site differs from the values advertised by the company, and from assessments of the models performance by other research groups. Specifically, the sensitivity of the software at this site is significantly different from the sensitivity reported by the company. These results underscore the necessity for independent evaluation of the software at institutions where it is implemented.

11

A Comparative Study: Diagnostic Performance of ChatGPT 3.5, Google Bard, Microsoft Bing, and Radiologists in Thoracic Radiology Cases

Gunes, Y. C.; Cesur, T.

2024-01-20 radiology and imaging 10.1101/2024.01.18.24301495 medRxiv

Top 0.1%

38.4%

Show abstract

PurposeTo investigate and compare the diagnostic performance of ChatGPT 3.5, Google Bard, Microsoft Bing, and two board-certified radiologists in thoracic radiology cases published by The Society of Thoracic Radiology. Materials and MethodsWe collected 124 "Case of the Month" from the Society of Thoracic Radiology website between March 2012 and December 2023. Medical history and imaging findings were input into ChatGPT 3.5, Google Bard, and Microsoft Bing for diagnosis and differential diagnosis. Two board-certified radiologists provided their diagnoses. Cases were categorized anatomically (parenchyma, airways, mediastinum-pleura-chest wall, and vascular) and further classified as specific or non-specific for radiological diagnosis. Diagnostic accuracy and differential diagnosis scores were analyzed using chi-square, Kruskal-Wallis and Mann-Whitney U tests. ResultsAmong 124 cases, ChatGPT demonstrated the highest diagnostic accuracy (53.2%), outperforming radiologists (52.4% and 41.1%), Bard (33.1%), and Bing (29.8%). Specific cases revealed varying diagnostic accuracies, with Radiologist I achieving (65.6%), surpassing ChatGPT (63.5%), Radiologist II (52.0%), Bard (39.5%), and Bing (35.4%). ChatGPT 3.5 and Bing had higher differential scores in specific cases (P<0.05), whereas Bard did not (P=0.114). All three had a higher diagnostic accuracy in specific cases (P<0.05). No differences were found in the diagnostic accuracy or differential diagnosis scores of the four anatomical location (P>0.05). ConclusionChatGPT 3.5 demonstrated higher diagnostic accuracy than Bing, Bard and radiologists in text-based thoracic radiology cases. Large language models hold great promise in this field under proper medical supervision.

12

Total and Stroke Related Imaging Utilization Patterns During the COVID-19 Pandemic

Tu, L. H.; Sharma, R.; Malhotra, A.; Schindler, J. L.; Forman, H. P.

2020-05-26 radiology and imaging 10.1101/2020.05.20.20078915 medRxiv

Top 0.1%

38.3%

Show abstract

During the COVID-19 pandemic, radiology practices are reporting a decrease in imaging volumes. We review total imaging volume, CTA head and neck volume, critical results rate, and stroke intervention rates before and during the COVID-19 pandemic. Total imaging volume as well as CTA head and neck imaging fell approximately 60% since the beginning of the pandemic. Critical results fell 60-70% for total imaging as well as for CTA head and neck. Compared to the same time frame a year prior, the number of stroke codes at the early impact of the pandemic had decreased approximately 50%. Proportional reductions in total imaging volume, stroke-related imaging, and associated critical result reports during the COVID-19 pandemic raise concern for missed stroke diagnoses in our population.

13

Prompt Engineering Strategies Improve the Diagnostic Accuracy of GPT-4 Turbo in Neuroradiology Cases

Wada, A.; Akashi, T.; Shih, G.; Hagiwara, A.; Nishizawa, M.; Hayakawa, Y.; Kikuta, J.; Shimoji, K.; Sano, K.; Kamagata, K.; Nakanishi, A.; Aoki, S.

2024-05-01 radiology and imaging 10.1101/2024.04.29.24306583 medRxiv

Top 0.1%

37.8%

Show abstract

BackgroundLarge language models (LLMs) like GPT-4 demonstrate promising capabilities in medical image analysis, but their practical utility is hindered by substantial misdiagnosis rates ranging from 30-50%. PurposeTo improve the diagnostic accuracy of GPT-4 Turbo in neuroradiology cases using prompt engineering strategies, thereby reducing misdiagnosis rates. Materials and MethodsWe employed 751 publicly available neuroradiology cases from the American Journal of Neuroradiology Case of the Week Archives. Prompt instructions guided GPT-4 Turbo to analyze clinical and imaging data, generating a list of five candidate diagnoses with confidence levels. Strategies included role adoption as an imaging expert, step-by-step reasoning, and confidence assessment. ResultsWithout any adjustments, the baseline accuracy of GPT-4 Turbo was 55.1% to correctly identify the top diagnosis, with a misdiagnosis rate of 29.4%. Considering the five candidates improved applicability, it is 70.6%. Applying a 90% confidence threshold increased the accuracy of the top diagnosis to 72.9% and the applicability of the five candidates to 85.9%, while reducing misdiagnoses to 14.1%, but limited the analysis to half of cases. ConclusionPrompt engineering strategies with confidence level thresholds demonstrated the potential to reduce misdiagnosis rates in neuroradiology cases analyzed by GPT-4 Turbo. This research paves the way for enhancing the feasibility of AI-assisted diagnostic imaging, where AI suggestions can contribute to human decision-making processes. However, the study lacks analysis of real-world clinical data. This highlights the need for further investigation in various specialties and medical modalities to optimize thresholds that balance diagnostic accuracy and practical utility.

14

Can GPT-4 suggest the optimal sequence for brain magnetic resonance imaging?

Suzuki, K.; Abe, K.; Sakai, S.

2024-08-02 radiology and imaging 10.1101/2024.07.31.24311123 medRxiv

Top 0.1%

37.8%

Show abstract

PurposeThis study aimed to evaluate the potential of GPT-4, a large language model, in assisting radiologists to determine brain magnetic resonance imaging (MRI) protocols. MethodsWe used brain MRI protocols from a specific hospital, covering 20 diseases or examination purposes, excluding brain tumor protocols. GPT-4 was given system prompts to add one MRI sequence for the basic brain MRI protocol and disease names were input as user prompts. The models suggestions were evaluated by two radiologists with over 20 years of relevant experience. Suggestions were scored based on their alignment with the hospitals protocol as follows: 0 for inappropriate, 1 for acceptable but nonmatching, and 2 for matching the protocol. The experiment was conducted in both Japanese and English to compare GPT-4s performance in different languages. ResultsGPT-4 scored 27/40 points in English and 28/40 points in Japanese. GPT-4 gave inappropriate suggestions for Moyamoya disease and neuromyelitis optica in both languages and cerebral infarction in Japanese. For the other protocols, the suggested sequences were either appropriate or better. The suggestions in English differed from those in Japanese for seven protocols. ConclusionGPT-4 can suggest appropriate MRI sequences for each disease in addition to the standard brain MRI protocol. GPT-4s output is language-dependent and suggests brain MRI protocols tailored to specific regions and domains.

15

Diagnostic accuracy of Clinical Radiology Reports for Trauma Radiographs: A retrospective validation study

Bruun, F. J.; Mueller, F. C.; Nybing, J. U.; Radev, D. I.; Honar, A. R.; Preuthun, J. R.; Laurberg, F. W.; Brejneboel, M. W.

2025-07-18 radiology and imaging 10.1101/2025.07.16.25331604 medRxiv

Top 0.1%

34.5%

Show abstract

Background/purposeDevelopment and validation of AI tools for diagnostic imaging requires high-quality annotations. Dedicated research readings are considered superior to the clinical radiologic report (CRR). However, the practices for generating CRR varies in rigor. The purpose of this study was to validate CRRs produced using a post-conference multiple disciplinary reader workflow by comparing them to dedicated research readings of trauma radiographs. Materials and MethodsThis retrospective study included consecutive patients referred for radiography at the indication of trauma at two university hospitals. For the index test, the CRRs were evaluated based on their description of eight common diagnoses. The reference standard was established as the agreement of two certified reporting technologists and arbitrated by a senior musculoskeletal radiologist. Sensitivity, specificity, positive and negative predictive values were calculated. ResultsThe study sample consisted of 618 consecutive examinations (median age 52 years IQR, 24, 76, 351 female). Fracture incidence was 36%. Incidence of other findings ranged from 1% (bone lesions) to 10% (degenerative disease). The sensitivities of the CRRs were: fracture (97% [94 to 99%]), luxation (87% [69 to 96%]), degenerative disease (67% [53 to 78%]), effusion (67% [46 to 83%]), old fracture (64% [48 to 78%]), subluxation (44% [14 to 79%]), halisteresis (30% [13 to 53%]) and bone lesion (25% [1 to 81%]). Specificity ranged from halisteresis (94% [92 to 96%]) to subluxation (100% [99 to 100%]) and luxation (100% [99 to 100%]). ConclusionWe found that the workflow at the tested hospitals generates clinical radiologic reports of trauma radiographs with a diagnostic accuracy for the assessment of fracture and luxation suitable for research quality labelling.

16

Large Language Models in Radiology Reporting - A Systematic Review of Performance, Limitations, and Clinical Implications

Artsi, Y.; Klang, E.; Collins, J. D.; Glicksberg, B. S.; Korfiatis, P.; Nadkarni, G.; Sorin, V.

2025-03-19 radiology and imaging 10.1101/2025.03.18.25324193 medRxiv

Top 0.1%

34.4%

Show abstract

BackgroundLarge language models (LLMs) have emerged as potential tools for automated radiology reporting. However, concerns regarding their fidelity, reliability, and clinical applicability remain. This systematic review examines the current literature on LLM-generated radiology reports. MethodsWe conducted a systematic search of MEDLINE, Google Scholar, Scopus, and Web of Science to identify studies published between January 2015 and February 2025. Studies evaluating LLM-generated radiology reports were included. The study follows PRISMA guidelines. Risk of bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool. ResultsNine studies met the inclusion criteria. Of these, six evaluated full radiology reports, while three focused on impression generation. Six studies assessed base LLMs, and three evaluated fine-tuned models. Fine-tuned models demonstrated better alignment with expert evaluations and achieved higher performance on natural language processing metrics compared to base models. All LLMs showed hallucinations, misdiagnoses, and inconsistencies. ConclusionLLMs show promise in radiology reporting. However, limitations in diagnostic accuracy and hallucinations necessitate human oversight. Future research should focus on improving evaluation frameworks, incorporating diverse datasets, and prospectively validating AI-generated reports in clinical workflows.

17

Uncertainty Quantification of Central Canal Stenosis Deep Learning Classifier from Lumbar Sagittal T2-Weighted MRI

Brenzikofer, A.; Monzon, M.; Galbusera, F.; Manjaly, Z.-M.; Cina, A.; Jutzeler, C. R.

2025-10-25 radiology and imaging 10.1101/2025.10.24.25338153 medRxiv

Top 0.1%

34.4%

Show abstract

BackgroundAccurate assessment of the severity of central canal stenosis (CCS) on lumbar spine MRI is critical for clinical decision-making. We evaluated deep learning models for automated CCS grading on sagittal T2-weighted MRI, focusing on uncertainty quantification to improve clinical reliability. MethodsUsing a retrospective cohort from the LumbarDISC dataset (1,974 patients), we compared multiple deep learning architectures for three-level CCS classification (normal / mild, moderate, severe). To assess model confidence, Monte Carlo (MC) dropout and Test Time Augmentation (TTA) techniques were applied to quantify prediction uncertainty. ResultsThe fine-tuned Spinal Grading Network (SGN) achieved a balanced accuracy of 79.4% and a macro F1 score of 68.8%, with per-class accuracies of 71.3% for moderate and 78.5% for severe stenosis. MC dropout revealed an increase in uncertainty predominantly in moderate and severe cases, while TTA uncertainty was higher for mild stenosis. ConclusionDL-based CCS grading demonstrates potential to assist radiologists by providing rapid, standardized evaluations. Incorporating uncertainty quantification offers a safeguard to flag ambiguous cases, thus supporting clinical trust and facilitating safer integration of AI tools into the interpretation of spine MRI.

18

Systematic review of natural language processing (NLP) applications in magnetic resonance imaging (MRI)

Mahameed, G.; Brin, D.; Konen, E.; Nadkarni, G.; Klang, E.

2024-07-21 radiology and imaging 10.1101/2024.07.21.24310760 medRxiv

Top 0.1%

34.1%

Show abstract

BackgroundAs MRI use grows in medical diagnostics, applying NLP techniques could improve management of related text data. This review aims to explore how NLP can augment radiological evaluations in MRI. MethodsWe conducted a PubMed search for studies that applied NLP in the clinical analysis of MRI, including publications up to January 4, 2024. The quality and potential bias of the included studies were assessed using the QUADAS-2 tool. ResultsTwenty-six studies published between April 2010 and January 2024, covering more than 160k MRI reports were analyzed. Most of these studies demonstrated low to no risk of bias of the NLP. Neurology was the most frequently studied specialty, with twelve studies, followed by musculoskeletal (MSK) and body imaging. Applications of NLP included staging, quantification, and disease diagnosis. Notably, NLP showed high precision in tumor staging classification and structuring of free-text reports. ConclusionNLP shows promise in enhancing the utility of MRI. However, there is a need for prospective studies to further validate NLP algorithms in real-time clinical and operational scenarios and across various radiology specialties, which could lead to broader applications in healthcare.

19

Distinguishing GPT-4-generated Radiology Abstracts from Original Abstracts: Performance of Blinded Human Observers and AI Content Detector

Ufuk, F.; Peker, H.; Sagtas, E.; Yagci, A. B.

2023-05-03 radiology and imaging 10.1101/2023.04.28.23289283 medRxiv

Top 0.1%

34.1%

Show abstract

ObjectiveTo determine GPT-4s effectiveness in writing scientific radiology article abstracts and investigate human reviewers and AI Content detectors success in distinguishing these abstracts. Additionally, to determine the similarity scores of abstracts generated by GPT-4 to better understand its ability to create unique text. MethodsThe study collected 250 original articles published between 2021 and 2023 in five radiology journals. The articles were randomly selected, and their abstracts were generated by GPT-4 using a specific prompt. Three experienced academic radiologists independently evaluated the GPT-4 generated and original abstracts to distinguish them as original or generated by GPT-4. All abstracts were also uploaded to an AI Content Detector and plagiarism detector to calculate similarity scores. Statistical analysis was performed to determine discrimination performance and similarity scores. ResultsOut of 134 GPT-4 generated abstracts, average of 75 (56%) were detected by reviewers, and average of 50 (43%) original abstracts were falsely categorized as GPT-4 generated abstracts by reviewers. The sensitivity, specificity, accuracy, PPV, and NPV of observers in distinguishing GPT-4 written abstracts ranged from 51.5% to 55.6%, 56.1% to 70%, 54.8% to 60.8%, 41.2% to 76.7%, and 47% to 62.7%, respectively. No significant difference was observed between observers in discrimination performance. ConclusionGPT-4 can generate convincing scientific radiology article abstracts. However, human reviewers and AI Content detectors have difficulty in distinguishing GPT-4 generated abstracts from original ones.

20

Assessing Performance of Multimodal ChatGPT-4 on an image based Radiology Board-style Examination: An exploratory study

Bera, K.; Gupta, A.; Jiang, S.; Berlin, S.; Faraji, N.; Tippareddy, C.; Chiong, I.; Jones, R.; Nemer, O.; Nayate, A.; Tirumani, S. H.; Ramaiya, N.

2024-01-13 radiology and imaging 10.1101/2024.01.12.24301222 medRxiv

Top 0.1%

29.7%

Show abstract

ObjectiveTo evaluate the performance of multimodal ChatGPT 4 on a radiology board-style examination containing text and radiologic images.s Materials and MethodsIn this prospective exploratory study from October 30 to December 10, 2023, 110 multiple-choice questions containing images designed to match the style and content of radiology board examination like the American Board of Radiology Core or Canadian Board of Radiology examination were prompted to multimodal ChatGPT 4. Questions were further sub stratified according to lower-order (recall, understanding) and higher-order (analyze, synthesize), domains (according to radiology subspecialty), imaging modalities and difficulty (rated by both radiologists and radiologists-in-training). ChatGPT performance was assessed overall as well as in subcategories using Fishers exact test with multiple comparisons. Confidence in answering questions was assessed using a Likert scale (1-5) by consensus between a radiologist and radiologist-in-training. Reproducibility was assessed by comparing two different runs using two different accounts. ResultsChatGPT 4 answered 55% (61/110) of image-rich questions correctly. While there was no significant difference in performance amongst the various sub-groups on exploratory analysis, performance was better on lower-order [61% (25/41)] when compared to higher-order [52% (36/69)] [P=.46]. Among clinical domains, performance was best on cardiovascular imaging [80% (8/10)], and worst on thoracic imaging [30% [3/10)]. Confidence in answering questions was confident/highly confident [89%(98/110)], even when incorrect There was poor reproducibility between two runs, with the answers being different in 14% (15/110) questions. ConclusionDespite no radiology specific pre-training, multimodal capabilities of ChatGPT appear promising on questions containing images. However, the lack of reproducibility among two runs, even with the same questions poses challenges of reliability.